Companion to Assessing shop completeness in OpenStreetMap for two federalstates in Germany - Analysis of driving factors for shop completeness

The data set used for the analysis contains the following columns:

Explorative analysis

# diagram shops and estimated asymptote
p1 <- ggplot(TabFactors, aes(x=avShops, y=asymptote, text=district)) + theme_minimal()  + labs(title="number of shops and asymptote") + ylab("asymptote") + xlab("number of shops") + 
  geom_point(lwd=0.5, alpha=0.8, aes(colour=districtTypeF)) +
  geom_abline(slope=1, intercept = 0, col="grey", lty=2) +
  theme(plot.title = element_text(size=10, face = "bold"))+ theme(axis.title.y=element_text(size=9)) +theme(axis.title.x=element_text(size=9))+ theme(legend.text = element_text(size = 9)) + theme(axis.text.y =  element_text(size = 7), axis.text.x =  element_text(size =7)) +
  #theme(legend.position = "bottom") +
  theme(legend.title = element_text(size = 9, face = "bold")) 

ggplotly(p1, Tooltip=c("district"))
#violin plot
ggplot(TabFactors, aes(x=districtTypeF, y=completeness, colour = districtTypeF)) + geom_violin(size=0.3) + 
  scale_fill_brewer(palette="Dark2") + geom_boxplot(width=0.1, size = 0.3)+
  #stat_summary(fun.y=median, geom="point", size=2, color="red")+
  #stat_summary(fun.y=mean, geom="point", shape=23, size=2)+
  labs(title = "completeness distribution by district type", y = "completeness level [%]", x="")+ theme(legend.position = "none") +  theme(plot.title = element_text(size=10, face="bold")) + theme(axis.title.y=element_text(size=9)) + theme(legend.text = element_text(size = 9)) + theme(axis.text.y =  element_text(size = 7), axis.text.x =  element_text(size =9, colour="black"))+
  geom_dotplot(binaxis='y', stackdir='center',position=position_dodge(1), binwidth=0.5, colour = "Black")

For interpretation on might want to take into account that more urban areas have also a higher unemployment rate:

ggplot(TabFactors, aes(x=districtTypeF, y= unemployments)) + geom_violin()

gUenmp <- lm(unemployments ~ districtTypeF, data= TabFactors)
summary(gUenmp)
## 
## Call:
## lm(formula = unemployments ~ districtTypeF, data = TabFactors)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1286 -0.5714 -0.5348  0.7061  4.4714 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       4.5286     0.4022  11.258 4.05e-14 ***
## districtTypeFurban districts     -0.9938     0.5102  -1.948   0.0583 .  
## districtTypeFindependent cities   1.0429     0.6967   1.497   0.1421    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.505 on 41 degrees of freedom
## Multiple R-squared:  0.2109, Adjusted R-squared:  0.1724 
## F-statistic:  5.48 on 2 and 41 DF,  p-value: 0.007777

However, differences are not strongly significant.

Regression analysis

A poisson GLM estimates significant coefficients for all four predictors considered in the analysis:

g <- glm(asymptote ~ districtTypeF + GDP + unemployments + academics + offset(log(avShops)),  data=TabFactors, family=poisson)

## deviance, estimates and significance
1-g$deviance/g$null.deviance
## [1] 0.2112687
summary(g)
## 
## Call:
## glm(formula = asymptote ~ districtTypeF + GDP + unemployments + 
##     academics + offset(log(avShops)), family = poisson, data = TabFactors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.6724  -2.4469  -0.8858   1.1783  19.8407  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -0.2315733  0.0463588  -4.995 5.88e-07 ***
## districtTypeFurban districts     0.0507730  0.0127542   3.981 6.87e-05 ***
## districtTypeFindependent cities -0.0764611  0.0193815  -3.945 7.98e-05 ***
## GDP                              0.0040363  0.0005294   7.624 2.47e-14 ***
## unemployments                    0.0482830  0.0042124  11.462  < 2e-16 ***
## academics                       -0.0118853  0.0020594  -5.771 7.87e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1041.39  on 43  degrees of freedom
## Residual deviance:  821.38  on 38  degrees of freedom
## AIC: 1217
## 
## Number of Fisher Scoring iterations: 4
## VIF

vif(g)
##                   GVIF Df GVIF^(1/(2*Df))
## districtTypeF 4.030361  2        1.416890
## GDP           2.307922  1        1.519185
## unemployments 2.635076  1        1.623292
## academics     3.706657  1        1.925268

Adjust for overdispersion

However, to account for overdispersion in the data we have to use a negative binomial GLM. This reduces the number of significant predictors due to the additional uncertainty that was (wrongly) ignored by the poisson GLM.

g <- glm.nb(asymptote ~ districtTypeF + GDP + unemployments + academics + offset(log(avShops)),  data=TabFactors)

## deviance, estimates and significance
1-g$deviance/g$null.deviance
## [1] 0.2581558
summary(g)
## 
## Call:
## glm.nb(formula = asymptote ~ districtTypeF + GDP + unemployments + 
##     academics + offset(log(avShops)), data = TabFactors, init.theta = 60.3149139, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9202  -0.6626  -0.2318   0.3660   4.1812  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -0.356284   0.218597  -1.630  0.10313    
## districtTypeFurban districts     0.020293   0.049716   0.408  0.68314    
## districtTypeFindependent cities -0.122508   0.094728  -1.293  0.19592    
## GDP                              0.005399   0.002640   2.045  0.04084 *  
## unemployments                    0.061954   0.018100   3.423  0.00062 ***
## academics                       -0.011840   0.009886  -1.198  0.23107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(60.3149) family taken to be 1)
## 
##     Null deviance: 59.045  on 43  degrees of freedom
## Residual deviance: 43.802  on 38  degrees of freedom
## AIC: 567.14
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  60.3 
##           Std. Err.:  13.7 
## 
##  2 x log-likelihood:  -553.14
## VIF

vif(g)
##                   GVIF Df GVIF^(1/(2*Df))
## districtTypeF 2.952698  2        1.310855
## GDP           1.900122  1        1.378449
## unemployments 2.201008  1        1.483579
## academics     2.984513  1        1.727574

Need to simplify model:

g <- update(g, ~. -academics)
summary(g)
## 
## Call:
## glm.nb(formula = asymptote ~ districtTypeF + GDP + unemployments + 
##     offset(log(avShops)), data = TabFactors, init.theta = 58.22614057, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7917  -0.6318  -0.2090   0.3476   4.2538  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                     -0.335472   0.221280  -1.516   0.1295   
## districtTypeFurban districts     0.003967   0.048742   0.081   0.9351   
## districtTypeFindependent cities -0.199261   0.072656  -2.743   0.0061 **
## GDP                              0.004311   0.002510   1.717   0.0859 . 
## unemployments                    0.054463   0.017052   3.194   0.0014 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(58.2261) family taken to be 1)
## 
##     Null deviance: 57.131  on 43  degrees of freedom
## Residual deviance: 43.769  on 39  degrees of freedom
## AIC: 566.55
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  58.2 
##           Std. Err.:  13.2 
## 
##  2 x log-likelihood:  -554.555
vif(g)
##                   GVIF Df GVIF^(1/(2*Df))
## districtTypeF 1.676818  2        1.137945
## GDP           1.660710  1        1.288685
## unemployments 1.888730  1        1.374311
drop1(g)
## Single term deletions
## 
## Model:
## asymptote ~ districtTypeF + GDP + unemployments + offset(log(avShops))
##               Df Deviance    AIC
## <none>             43.769 564.55
## districtTypeF  2   52.484 569.27
## GDP            1   46.724 565.51
## unemployments  1   54.119 572.90

(This is the model used in the manuscript.)

g <- update(g, ~. -GDP)
summary(g)
## 
## Call:
## glm.nb(formula = asymptote ~ districtTypeF + unemployments + 
##     offset(log(avShops)), data = TabFactors, init.theta = 54.48131943, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5501  -0.6844  -0.2439   0.3335   4.2223  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                      0.026062   0.076401   0.341  0.73301   
## districtTypeFurban districts     0.008873   0.050061   0.177  0.85932   
## districtTypeFindependent cities -0.140452   0.065857  -2.133  0.03295 * 
## unemployments                    0.037443   0.014481   2.586  0.00972 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(54.4813) family taken to be 1)
## 
##     Null deviance: 53.677  on 43  degrees of freedom
## Residual deviance: 43.894  on 40  degrees of freedom
## AIC: 567.42
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  54.5 
##           Std. Err.:  12.3 
## 
##  2 x log-likelihood:  -557.421
vif(g)
##                   GVIF Df GVIF^(1/(2*Df))
## districtTypeF 1.278499  2        1.063347
## unemployments 1.278499  1        1.130707
1 - g$deviance / g$null.deviance
## [1] 0.1822509

Compare with model with GDP instead of unemployment rate

g2 <- glm.nb(formula = asymptote ~ districtTypeF + GDP + offset(log(avShops)), data = TabFactors, link = log)
summary(g2)
## 
## Call:
## glm.nb(formula = asymptote ~ districtTypeF + GDP + offset(log(avShops)), 
##     data = TabFactors, link = log, init.theta = 46.79504818)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2052  -0.6465  -0.2734   0.2403   4.8963  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      0.2297373  0.1553730   1.479    0.139
## districtTypeFurban districts    -0.0331359  0.0527921  -0.628    0.530
## districtTypeFindependent cities -0.1034678  0.0727771  -1.422    0.155
## GDP                             -0.0004187  0.0022939  -0.183    0.855
## 
## (Dispersion parameter for Negative Binomial(46.795) family taken to be 1)
## 
##     Null deviance: 46.498  on 43  degrees of freedom
## Residual deviance: 44.057  on 40  degrees of freedom
## AIC: 573.9
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  46.8 
##           Std. Err.:  10.5 
## 
##  2 x log-likelihood:  -563.9

Clearly worst, so we stick with g. Are were interactions?

g2 <- glm.nb(formula = asymptote ~ districtTypeF * unemployments + offset(log(avShops)), data = TabFactors, link = log)
summary(g2)
## 
## Call:
## glm.nb(formula = asymptote ~ districtTypeF * unemployments + 
##     offset(log(avShops)), data = TabFactors, link = log, init.theta = 55.99395915)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7989  -0.6115  -0.2448   0.2852   3.9883  
## 
## Coefficients:
##                                               Estimate Std. Error z value
## (Intercept)                                   -0.01953    0.08769  -0.223
## districtTypeFurban districts                   0.15718    0.14882   1.056
## districtTypeFindependent cities               -0.01963    0.24440  -0.080
## unemployments                                  0.04732    0.01728   2.738
## districtTypeFurban districts:unemployments    -0.03902    0.03730  -1.046
## districtTypeFindependent cities:unemployments -0.02340    0.04344  -0.539
##                                               Pr(>|z|)   
## (Intercept)                                    0.82373   
## districtTypeFurban districts                   0.29090   
## districtTypeFindependent cities                0.93598   
## unemployments                                  0.00618 **
## districtTypeFurban districts:unemployments     0.29544   
## districtTypeFindependent cities:unemployments  0.59021   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(55.994) family taken to be 1)
## 
##     Null deviance: 55.075  on 43  degrees of freedom
## Residual deviance: 43.816  on 38  degrees of freedom
## AIC: 570.21
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  56.0 
##           Std. Err.:  12.6 
## 
##  2 x log-likelihood:  -556.213

Nope.

Diagnostics

plot(g, which=1:6, labels.id = TabFactors$district)

Nordsachsen is highly influential, characterized by a high cook’s distance and a high leverage. Görlitz is also remarkable due to its high leverage.

What if we leave out Nordsachsen?

#TabFactors$district
gPart <- glm.nb(asymptote ~ unemployments +  offset(log(avShops)),  data=TabFactors, subset= district != "Nordsachsen, rural district" )

## deviance, estimates and significance
1-gPart$deviance/gPart$null.deviance
## [1] 0.008840667
summary(gPart)
## 
## Call:
## glm.nb(formula = asymptote ~ unemployments + offset(log(avShops)), 
##     data = TabFactors, subset = district != "Nordsachsen, rural district", 
##     init.theta = 121.1133921, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4628  -0.7879  -0.2650   0.6885   2.8794  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.167831   0.041805   4.015 5.95e-05 ***
## unemployments -0.005804   0.009459  -0.614    0.539    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(121.1134) family taken to be 1)
## 
##     Null deviance: 43.223  on 42  degrees of freedom
## Residual deviance: 42.841  on 41  degrees of freedom
## AIC: 518.7
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  121.1 
##           Std. Err.:  29.8 
## 
##  2 x log-likelihood:  -512.697

If we drop Görlitz as another district with high leverage, the relationship with unemployment holds.

gPart <- glm.nb(asymptote ~ unemployments +  offset(log(avShops)),  data=TabFactors, subset= district != "Nordsachsen, rural district" | district != "Görlitz, rural district")

## deviance, estimates and significance
1-gPart$deviance/gPart$null.deviance
## [1] 0.07963517
summary(gPart)
## 
## Call:
## glm.nb(formula = asymptote ~ unemployments + offset(log(avShops)), 
##     data = TabFactors, subset = district != "Nordsachsen, rural district" | 
##         district != "Görlitz, rural district", init.theta = 48.01488863, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2442  -0.6825  -0.2325   0.3200   4.4812  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    0.05807    0.06134   0.947   0.3438  
## unemployments  0.02571    0.01361   1.890   0.0588 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(48.0149) family taken to be 1)
## 
##     Null deviance: 47.645  on 43  degrees of freedom
## Residual deviance: 43.851  on 42  degrees of freedom
## AIC: 568.62
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  48.0 
##           Std. Err.:  10.7 
## 
##  2 x log-likelihood:  -562.621

If we only drop Görlitz

gPart <- glm.nb(asymptote ~ districtTypeF + unemployments +  offset(log(avShops)),  data=TabFactors, subset= district != "Nordsachsen, rural district" | district != "Görlitz, rural district")

## deviance, estimates and significance
1-gPart$deviance/gPart$null.deviance
## [1] 0.1822509
summary(gPart)
## 
## Call:
## glm.nb(formula = asymptote ~ districtTypeF + unemployments + 
##     offset(log(avShops)), data = TabFactors, subset = district != 
##     "Nordsachsen, rural district" | district != "Görlitz, rural district", 
##     init.theta = 54.48131943, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5501  -0.6844  -0.2439   0.3335   4.2223  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                      0.026062   0.076401   0.341  0.73301   
## districtTypeFurban districts     0.008873   0.050061   0.177  0.85932   
## districtTypeFindependent cities -0.140452   0.065857  -2.133  0.03295 * 
## unemployments                    0.037443   0.014481   2.586  0.00972 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(54.4813) family taken to be 1)
## 
##     Null deviance: 53.677  on 43  degrees of freedom
## Residual deviance: 43.894  on 40  degrees of freedom
## AIC: 567.42
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  54.5 
##           Std. Err.:  12.3 
## 
##  2 x log-likelihood:  -557.421

Coefficient estimates stay the same if Görlitz is excluded or not.

Districts for those no reliable saturation level could be estimated

gSat <- glm(noSaturationEstimated ~ districtTypeF + unemployments + GDP, data=TabFactorsAll, family=binomial)
summary(gSat)
## 
## Call:
## glm(formula = noSaturationEstimated ~ districtTypeF + unemployments + 
##     GDP, family = binomial, data = TabFactorsAll)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0738  -0.6984  -0.6634  -0.5518   1.8746  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)                     -1.735336   3.266384  -0.531    0.595
## districtTypeFurban districts     0.398078   0.878130   0.453    0.650
## districtTypeFindependent cities  0.944409   0.989588   0.954    0.340
## unemployments                    0.104953   0.272542   0.385    0.700
## GDP                             -0.005128   0.035826  -0.143    0.886
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 61.210  on 56  degrees of freedom
## Residual deviance: 59.523  on 52  degrees of freedom
## AIC: 69.523
## 
## Number of Fisher Scoring iterations: 4
drop1(gSat)
## Single term deletions
## 
## Model:
## noSaturationEstimated ~ districtTypeF + unemployments + GDP
##               Df Deviance    AIC
## <none>             59.523 69.523
## districtTypeF  2   60.452 66.452
## unemployments  1   59.670 67.670
## GDP            1   59.544 67.544
gSat <- update(gSat, ~.- districtTypeF)
summary(gSat)
## 
## Call:
## glm(formula = noSaturationEstimated ~ unemployments + GDP, family = binomial, 
##     data = TabFactorsAll)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9813  -0.7163  -0.6530  -0.6292   1.8313  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -2.593145   2.852478  -0.909    0.363
## unemployments  0.182698   0.214270   0.853    0.394
## GDP            0.008218   0.031821   0.258    0.796
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 61.210  on 56  degrees of freedom
## Residual deviance: 60.452  on 54  degrees of freedom
## AIC: 66.452
## 
## Number of Fisher Scoring iterations: 4
drop1(gSat)
## Single term deletions
## 
## Model:
## noSaturationEstimated ~ unemployments + GDP
##               Df Deviance    AIC
## <none>             60.452 66.452
## unemployments  1   61.169 65.169
## GDP            1   60.517 64.517
gSat <- update(gSat, ~.- unemployments)
summary(gSat)
## 
## Call:
## glm(formula = noSaturationEstimated ~ GDP, family = binomial, 
##     data = TabFactorsAll)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7560  -0.7280  -0.7152  -0.6779   1.8189  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.819006   2.003200  -0.409    0.683
## GDP         -0.005747   0.028487  -0.202    0.840
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 61.210  on 56  degrees of freedom
## Residual deviance: 61.169  on 55  degrees of freedom
## AIC: 65.169
## 
## Number of Fisher Scoring iterations: 4
drop1(gSat)
## Single term deletions
## 
## Model:
## noSaturationEstimated ~ GDP
##        Df Deviance    AIC
## <none>      61.169 65.169
## GDP     1   61.210 63.210
ggplot(TabFactorsAll, aes(x=noSaturationEstimated, y= GDP)) + geom_boxplot(varwidth=TRUE) + facet_wrap(~districtTypeF)

ggplot(TabFactorsAll, aes(x=noSaturationEstimated, y= unemployments)) + geom_boxplot(varwidth=TRUE) + facet_wrap(~districtTypeF)